Approximating Prediction Uncertainty for Random Forest Regression Models
نویسندگان
چکیده
Machine learning approaches such as random forest have increased for the spatial modeling and mapping of continuous variables. Random forest is a non-parametric ensemble approach, and unlike traditional regression approaches there is no direct quantification of prediction error. Understanding prediction uncertainty is important when using model-based continuous maps as inputs to other modeling applications such as fire modeling. Here we use a Monte Carlo approach to quantify prediction uncertainty for random forest regression models. We test the approach by simulating maps of dependent and independent variables with known characteristics and comparing actual errors with prediction errors. Our approach produced conservative prediction intervals across most of the range of predicted values. However, because the Monte Carlo approach was data driven, prediction intervals were either too wide or too narrow in sparse parts of the prediction distribution. Overall, our approach provides reasonable estimates of prediction uncertainty for random forest regression models. Introduction Remote sensing scientists have a rich set of standard methods with which the uncertainty of (inherently categorical) thematic maps derived from remotely-sensed data can be estimated (e.g., Congalton and Green, 2008). For the most part, resulting uncertainty estimates are (a) independent of the analytical method used for the categorical data analysis, and (b) contain information on category-specific accuracy but not pixel specific accuracy. Methods with which to estimate the uncertainty of mapped continuous fields are, in contrast, much less standardized. Category-specific accuracy, of course, is no longer relevant, but the means by which uncertainty of continuous variables is estimated is often tied to the technique used. Examples abound, including use of RMSE in classical regression oriented approaches (Fernandes et al., 2004) and cross-validationderived PRESS (sum of squares of the prediction residuals) RMSE (Popescu et al., 2004). Cross-validation approaches are also widely used in regression tree analyses of remotely sensed data (Bacini et al., 2007). The cross-validation can estimate many prediction error statistics, including residual sum of squares. However, increasingly cross-validation is used primarily for model selection and (usually non-parametric) bootstrapping is used once the model is “fixed” (see, e.g., Molinaro, 2005). These methods have been extended to random forest implementations, but the resulting estimates of prediction uncertainty are aggregated (i.e., global) and do not produce pixel-specific uncertainties required for use in subsequent spatial modeling. The use of machine learning techniques has increased substantially in remote sensing and geospatial data development. For example, Homer et al. (2004) used regression trees for the development of a categorical land cover map for the United States, and Coulston et al. (2012) used random forests to develop a continuous field map of percent tree canopy cover. Other techniques that have been proposed and tested include artificial neural networks, support vector machines, stochastic gradient boosting, and K nearest neighbor (Moisen and Frescino, 2002; Wieland and Pittore, 2014). Machine learning approaches have become particularly attractive because they are well suited to recognize patterns in high-dimension data (Cracknell and Reading, 2014). Further, several of these approaches allow for modeling either categorical response variables or continuous response variables (e.g., random forests, support vector machines/support vector regression). However unlike traditional parametric approaches (e.g., multiple regression), information about prediction error (standard error of a prediction for a new data point) is not readily available. Broad scale raster maps of continuous variables have been developed for percent impervious surface (Homer et al., 2007), percent tree canopy (Huang et al., 2001; Coulston et al., 2012), forest biomass (Blackard et al., 2008), and forest carbon (Wilson et al., 2013) among other examples. These efforts all relied on machine learning approaches and used either Landsat or MODIS imagery for predictor variables. Each pixel within these modeled raster maps contains a predicted value yet, per-pixel uncertainty is rarely expressed along with the predictions. Understanding the pixel-level uncertainty is critical to understanding the utility of the data. Furthermore, many geospatial datasets (such as those mentioned above) are used in subsequent modeling applications. For example, the 2001 NLCD tree canopy cover dataset (Huang et al., 2001) was a major component of forest fire behavior and fuel models (Rollins and Frame, 2006). Clearly the uncertainty around this fire behavior model is related to the uncertainty in the underlying data, such as the 2001 NLCD percent tree canopy cover. Our intent is to provide guidance on quantifying prediction uncertainty at the pixel level. While there are numerous machine learning techniques, here we focus on random forest because it is straightforward to train, computationally efficient, and provides stable predictions (Cracknell and Reading, 2014). Random forest is an ensemble method that uses bootstrap aggregating (bagging) to develop multiple models to improve prediction (Breiman, 2001). Along with bagging, random forests also relies on random feature selection to develop a forest of independant CART models. This technique has been used by Powell et al. (2010) and Baccini et al. (2008) to predict forest biomass, Evans and Cushman (2009) to predict species occurrence probability, Hernandez et al. (2008) to predict faunal species distributions, and Moisen et al. (2012) to predict percent tree canopy cover. Though there have been numerous studies describing and using random forests, there is a lack of information regarding prediction John W. Coulston is with the USDA Forest Service, Southern Research Station, Blacksburg, VA ([email protected]). Christine E. Blinn , Valerie A. Thomas , and Randolph H. Wynne are with Virginia Polytechnic Institute and State University, Department of Forest Resources and Environmental Conservation, Blacksburg, VA. Photogrammetric Engineering & Remote Sensing Vol. 82, No. 3, March 2016, pp. 189–197. 0099-1112/16/189–197 © 2016 American Society for Photogrammetry and Remote Sensing doi: 10.14358/PERS.82.3.189 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING March 2016 189 03-16 March Peer Reviewed Revised.indd 189 2/17/2016 12:09:46 PM uncertainty. The objective of this study is to develop a technique to approximate prediction uncertainty for random forest models of continuous data. In our case we consider prediction uncertainty to be the uncertainty around a future prediction for a new observation (i.e., pixel-level uncertainty). We further present a case example using predicted percent tree canopy cover in Georgia, USA. Portraying map uncertainty is an important component for geospatial data developers to consider. In some cases, prediction uncertainty is a central component to developing a final geospatial dataset. For example, the 2001 and 2011 NLCD percent tree canopy cover layers strived to mask out areas where there is clearly no tree canopy cover but canopy cover models predict low levels of tree canopy cover. In the 2001 NLCD percent tree canopy cover layer, the mask was created by creating a “liberal” tree cover map and hand editing (Huang et al., 2001). However, the techniques presented here facilitate a more parsimonious approach. Methods Through the methods section we use standard matrix and bootstrap notation. Bold lower case letters (e.g., y) represent a vector. Bold upper case letters (e.g., X) represent a matrix. A * superscript followed by b (e.g., y*b) refers to the bth bootstrap sample and a * superscript followed by –b (e.g., y*-b) denotes the portion of the original data that was not part of the bth bootstrap sample. Greek letters represent parameters (e.g., τ) and a vector of parameters or matrices of parameters are bold as described above. Random Forest Overview We provide a brief overview of Random Forest but point the interested reader to Breiman (2001) for more details. Random forest is an ensemble approach that relies on CART models. The goal of CART is to understand (learn) the relationship between a dependent variable (y) and a set of predictor variables (X). The learning algorithm employs recursive portioning in which splits in the X variables are selected to create homogenous groupings of y. The recursive portioning continues until either the subset of y at each node is the same value or further splitting adds no further improvement. Random forest differs from the CART procedure by (a) employing bootstrap resampling (Efron and Tibshirani, 1993), and (b) random variable selection. Consider a regression tree which is made up of splits and nodes. With random forest a random subset of X variables (selected without replacement) is used to determine the split for each node. Bootstrap resampling is used to develop replicates of the CART model. For continuous variables the ensemble estimate is the mean of the predicted values across trees (ŷ _ ) and the variance across trees is var(ŷ). Methods Overview Generally speaking, our method to approximate prediction uncertainty for random forest regression models has five main steps (Figure 1). Step 1 is to fit a random forest model based on all available data. Step 2 is to use bootstrap resampling to parameterize a large number of random forest models (Figure 1B). Bootstrap resampling generally results in ~37 percent of the observations not being selected. Step 3: for each bootstrap replicate of the random forest (RF) model the observed values and predicted values are retained for observations not included in the bootstrap sample (Figure 1C). Step 3 yields an error assessment dataset. In Step 4 the properties of the prediction error are quantified using the error assessment dataset (Figure 1D). Step 5 is to make a prediction, including error, for a new observation (Figure 1E). Bootstrap Resampling The bootstrap is one tool that can also be used to approximate prediction uncertainty of a RF model (Figure 1B). Consider the response and predictor variables (y, X) where a bootstrap sample of (y, X) is (y*b, X*b). Suppose we draw B = 2000 bootstrap samples to create B = 2000 bootstrap datasets. Using the bootstrap datasets we construct RF*1, RF*2, ..., RF*2000 random forest models, and then for each replicate quantify the prediction error for each observation in (y*-b) based on the corresponding RF*b replicate. The error assessment is constructed for each observation based on the distribution of predicted values when the observation was not part of the bootstrap sample (Figure 1C). The prediction error is √MSE where MSE is the mean square error for each observation. This technique allows one to quantify prediction error for each element of a holdout dataset but does not directly apply to predictions based on a new X. However, because random forest relies on bootstrap sampling to construct the ensemble, a random forest model contains information that we can use to quantify prediction uncertainty for new locations (i.e., new X data are available, Figure 1D and 1E). Prediction Uncertainty In traditional parametric models (e.g., multiple regression), the prediction error for a new observation is a function of the mean squared error (MSE) and the variability in X. Recall that a random forest model is an ensemble of CART models and the ensemble estimate is the mean across the set of CART model predictions. Each of the CART models is considered a weak learner. The predictions from these weak learners inherently capture the variability in the relationship between X and y. We can calculate the variance among predictions, , for each observation in X, which represents the variability of predictions among CART models. However, we need to scale between var(ŷ) and (yŷ _ )2 to approximate prediction uncertainty (Figure 1D). This is because to approximate the prediction uncertainty for a new observation only var(ŷ) will be available. A measure such as = − ( )
منابع مشابه
Application of ensemble learning techniques to model the atmospheric concentration of SO2
In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...
متن کاملApproximating prediction error covariances among additive genetic effects within animals in multiple-trait and random regression models
A method for approximating prediction error variances and covariances among estimates of individual animals genetic effects for multiple-trait and random regression models is described. These approximations are used to calculate the prediction error variances of linear functions of the terms in the model. In the multiple-trait case these are indexes of estimated breeding values, and for random ...
متن کاملComparison of Random Survival Forests for Competing Risks and Regression Models in Determining Mortality Risk Factors in Breast Cancer Patients in Mahdieh Center, Hamedan, Iran
Introduction: Breast cancer is one of the most common cancers among women worldwide. Patients with cancer may die due to disease progression or other types of events. These different event types are called competing risks. This study aimed to determine the factors affecting the survival of patients with breast cancer using three different approaches: cause-specific hazards regression, subdistri...
متن کاملMondrian Forests for Large-Scale Regression when Uncertainty Matters
Many real-world regression problems demand a measure of the uncertainty associated with each prediction. Standard decision forests deliver efficient state-of-the-art predictive performance, but high-quality uncertainty estimates are lacking. Gaussian processes (GPs) deliver uncertainty estimates, but scaling GPs to large-scale data sets comes at the cost of approximating the uncertainty estimat...
متن کاملThe prediction of Persian Squirrel Distribution Using a Combined Modeling Approach in the Forest Landscapes of Luristan Province
Habitat destruction is the most important factor determining species extinction; hence, the management of wildlife populations necessitates the management of habitats. Habitat suitability modeling is one of the best tools used for habitat management. There are several methods for habitat suitability modeling, with each of having some different advantages and disadvantages. In this study, we us...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016